
Based on material from the Georgetown Health Informatics and Data Science program and licensed under CC4.0
In this workshop we will explore how to classify non-linear data using linear classifiers such as Logistic Regression and SVM
We will start by loading some of the packages that will help us organize and visualize the data. Other packages will be loaded as necessary.
library(ggplot2)
library(dplyr)
library(tidyr)
# Set default plot theme (equivalent to seaborn's default in Python)
theme_set(theme_minimal())
# Set plot size and general theme options for all plots
options(repr.plot.width = 10, repr.plot.height = 10)
We need to write a function to generate linear seperable synthetic data. We do this using the "mlbench" library to generate our 2 distributions. For more see: https://www.rdocumentation.org/packages/mlbench/versions/2.1-6
library(MASS)
install.packages("mlbench")
library(mlbench)
set.seed(2)
The "2dnormals" function is used to generate samples with 200 datapoints.
# Use mlbench.2dnormals to create two informative features and two classes
n_samples <- 200
n_clusters_per_class <- 1
data <- mlbench.2dnormals(n_samples, 2) # 2 classes and 2 informative features
# Extract features and labels
X <- data$x
y <- as.factor(data$classes)
# Create a data frame for plotting
df <- data.frame(X1 = X[, 1], X2 = X[, 2], label = y)
Here we can see the distribution of the 2 classes of data we generated.
ggplot(df, aes(x = X1, y = X2, color = label)) +
geom_point(size = 3) +
labs(title = "Scatter Plot of Classification Data", x = "Feature 1", y = "Feature 2") +
theme_minimal()
We can explore the structure of the dataset using dim() length() and head().
dim(X)
length(y)
head(X)
head(y)
We can write a function to plot the 2D dataset as well.
visualize_2d <- function(X, y, title = "Data in 2D") {
# Convert to a data frame
df <- data.frame(X1 = X[, 1], X2 = X[, 2], label = as.factor(y))
# Create the plot
plot <- ggplot(df, aes(x = X1, y = X2, color = label)) +
geom_point(size = 3) +
labs(title = title, x = "x1", y = "x2") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
return(plot) # Return the ggplot object (equivalent to fig in Python)
}
That lets us save our plots as an object in R.
fig <- visualize_2d(X, y)
fig
We can also write our own function to generate the data. Below, we use the built in functions to generate the different classes of our dataset. The number of classes corresponds to the number of random points we want to use as the central point of the distribution. We can also increase the number of features in the dataset beyond 2.
make_blobs <- function(n_samples, n_features, centers, cluster_std) {
blobs <- list()
labels <- rep(NA, n_samples)
samples_per_center <- n_samples / centers
for (i in 1:centers) {
# Create blob with multivariate normal distribution
mu <- rnorm(n_features, mean = i * 2)
sigma <- diag(cluster_std, n_features)
blob <- mvrnorm(n = samples_per_center, mu = mu, Sigma = sigma)
blobs[[i]] <- blob
labels[((i - 1) * samples_per_center + 1):(i * samples_per_center)] <- i
}
X <- do.call(rbind, blobs)
return(list(X = X, labels = labels))
}
blobs <- make_blobs(n_samples = 200, n_features = 2, centers = 2, cluster_std = 0.60)
X <- blobs$X
y <- blobs$labels
We can use the previously written function to visualize the 2 features for each point in the dataset, colored by their class label.
fig <- visualize_2d(X, y)
print(fig)
Visualize the data
Let's fit a Logistic Regression model on this 2d data
df <- data.frame(X1 = X[, 1], X2 = X[, 2], label = as.factor(y))
# Fit the logistic regression model
logistic_model <- glm(label ~ X1 + X2, data = df, family = "binomial")
# Summary of the model
summary(logistic_model)
We need to write a function that will plot a line on the 2D plot of the data. First, a grid of xy points is gene. Using the trained model, the prediction for each point on the grid is made, and used to generate the decision boundary.
# Function to plot decision boundary in R
plot_lr_decision_boundary <- function(model, df) {
# Create a grid of values
xlim <- range(df$X1)
ylim <- range(df$X2)
# Create grid of values
grid <- expand.grid(X1 = seq(xlim[1], xlim[2], length.out = 100),
X2 = seq(ylim[1], ylim[2], length.out = 100))
# Predict probabilities for grid
grid$prob <- predict(model, newdata = grid, type = "response")
# Plot data and decision boundary using ggplot2
library(ggplot2)
p <- ggplot(df, aes(x = X1, y = X2, color = label)) +
geom_point() +
geom_contour(data = grid, aes(z = prob), breaks = 0.5, color = "black") +
labs(title = "Logistic Regression Decision Boundary") +
theme_minimal()
print(p)
}
Visualize the decision boundry. We can see the result is a line that separates the 2 labels in the dataset.
# Example usage:
# Assume 'df' is the dataframe and 'logistic_model' is the fitted model
plot_lr_decision_boundary(logistic_model, df)
A feature cross is a synthetic feature that encodes nonlinearity in the feature space by multiplying two or more input features together. Example: x3 = x1*x2
To generate a non-linearly seperable 2d data set, we create labels for a normally distributed set of 2D points based on thier quadrant in the 2D plot.
create_xor_dataset <- function(num_samples = 100) {
set.seed(0)
X <- matrix(rnorm(num_samples * 2), ncol = 2)
y_xor <- xor(X[, 1] > 0, X[, 2] > 0)
y <- ifelse(y_xor, 1, 0)
return(list(X = X, y = y))
}
xor_data <- create_xor_dataset(200)
X <- xor_data$X
y <- xor_data$y
# Check first 10 labels
head(y, 10)
We can plot the XOR dataset, and see the distribution of the labels.
# Visualize in 2D
df <- data.frame(X1 = X[, 1], X2 = X[, 2], label = as.factor(y))
ggplot(df, aes(x = X1, y = X2, color = label)) +
geom_point() +
labs(title = "XOR Dataset") +
theme_minimal()
We first fit an LR model without generating a feature cross.
# Convert the generated data to a dataframe for R
df <- data.frame(X1 = X[, 1], X2 = X[, 2], label = as.factor(y))
# Fit the logistic regression model
lr_model <- glm(label ~ X1 + X2, data = df, family = "binomial")
# Summary of the model
summary(lr_model)
When we plot the decision boundary, we see the model doesn't do a good job of separating the labels.
plot_lr_decision_boundary(lr_model,df)
We transform 2D data to higher dimension by adding a feature cross. The 3rd feature is going to be the product of the two features.
x3 <- X[, 1] * X[, 2]
length(x3)
head(x3,10)
X_df <- data.frame(x1 = X[, 1], x2 = X[, 2], x3 = x3, y = y)
head(X_df)
X_transformed <- as.matrix(X_df[, c("x1", "x2", "x3")])
Now we have a dataset with 3 features, one of them being our feature cross.
dim(X) # Dimensions of the original X
dim(X_transformed) # Dimensions of X_transformed
# Install plotly if not installed
if (!requireNamespace("plotly", quietly = TRUE)) {
install.packages("plotly")
}
library(plotly)
visualize_3d <- function(X, y) {
fig <- plot_ly(x = ~X[, 1], y = ~X[, 2], z = ~X[, 3],
color = ~as.factor(y), colors = c("#636EFA", "#EF553B"),
type = "scatter3d", mode = "markers")
fig <- fig %>% layout(scene = list(xaxis = list(title = 'x1'),
yaxis = list(title = 'x2'),
zaxis = list(title = 'x3')))
fig
}
if (!requireNamespace("scatterplot3d", quietly = TRUE)) {
install.packages("scatterplot3d")
}
library(scatterplot3d)
visualize_3d <- function(X, y, angles = c(0, 45, 90, 135, 180)) {
colors <- c("#636EFA", "#EF553B")
color_labels <- colors[as.factor(y)]
par(mfrow = c(2, 3)) # Set up the plotting area to display multiple plots
for (angle in angles) {
scatterplot3d(X[, 1], X[, 2], X[, 3], color = color_labels, pch = 16,
xlab = "x1", ylab = "x2", zlab = "x3", main = paste("Angle:", angle),
angle = angle)
}
par(mfrow = c(1, 1)) # Reset the plotting area
}
visualize_3d(X_transformed,y)
We can fit the LR model with our feature cross added to the data.
# Install necessary packages if not already installed
if (!requireNamespace("caret", quietly = TRUE)) {
install.packages("caret")
}
if (!requireNamespace("e1071", quietly = TRUE)) {
install.packages("e1071") # for logistic regression
}
library(caret)
library(e1071)
# Fit logistic regression model in 2D and 3D
lr_model_2d <- glm(y ~ X_df$x1 + X_df$x2, family = binomial)
lr_model_3d <- glm(y ~ X_df$x1 + X_df$x2 + X_df$x3, family = binomial)
We can compare the performace of the model on the 2D and 3D datasets by setting a threshold of 0.5 on the prediction probability.
# Predict for 2D and 3D models
y1 <- ifelse(predict(lr_model_2d, type = "response") > 0.5, 1, 0)
y2 <- ifelse(predict(lr_model_3d, type = "response") > 0.5, 1, 0)
We can build a confusion matrix of the predicitions, and see the 3D model preforms much better than the 2D model.
# Classification report equivalent: confusion matrix and metrics
conf_matrix_2d <- confusionMatrix(as.factor(y1), as.factor(y))
conf_matrix_3d <- confusionMatrix(as.factor(y2), as.factor(y))
# Print results
conf_matrix_2d
conf_matrix_3d
# Print detailed metrics
print("2D Logistic Regression Performance")
print(conf_matrix_2d$byClass) # Includes precision, recall, F1, and support
print("3D Logistic Regression Performance")
print(conf_matrix_3d$byClass) # Includes precision, recall, F1, and support
To make things easier, we can write a funtion to get the results of the GLM training.
get_training_results_glm <- function(X, y) {
# Create a data frame with X (features) and y (labels)
df <- as.data.frame(cbind(X, y))
# Fit logistic regression model
model <- glm(y ~ ., data = df, family = binomial)
# Predict class probabilities on training data
y_pred_prob <- predict(model, df, type = "response")
# Convert probabilities to binary labels (threshold = 0.5)
y_pred <- ifelse(y_pred_prob > 0.5, 1, 0)
# Confusion matrix and classification report
results <- confusionMatrix(as.factor(y_pred), as.factor(y), mode = "everything")
# Print classification metrics (precision, recall, F1-score, etc.)
print(results)
}
Try another dataset, this time the nonlinear data willbe labels by the distance from the origin of the 2D dataset.
make_circles <- function(n_samples = 200, factor = 0.3, noise = 0.3) {
theta <- runif(n_samples, 0, 2 * pi)
r <- c(rep(1, n_samples / 2), rep(factor, n_samples / 2))
X <- r * cbind(cos(theta), sin(theta)) + matrix(rnorm(n_samples * 2, 0, noise), ncol = 2)
y <- c(rep(0, n_samples / 2), rep(1, n_samples / 2))
data.frame(X1 = X[, 1], X2 = X[, 2], label = y)
}
# Generate circular dataset
data <- make_circles(n_samples = 200, factor = 0.3, noise = 0.3)
ggplot(data, aes(x = X1, y = X2, color = as.factor(label))) +
geom_point(size = 3) +
labs(title = "Data in 2D", x = "x1", y = "x2", color = "Label") +
theme_minimal()
Our new, non-linear feature is generated by calculating the e raised the negative sum of squares.
# Calculate new feature x3 as the radial basis function (exp(-sum of squares))
X <- data
X_transformed <- data %>%
mutate(x3 = exp(-(X1^2 + X2^2)))
# Display the first few rows of the transformed dataset
head(X_transformed)
head(X)
X_transformed <- X_transformed %>% select(X1, X2, x3) %>% as.matrix()
dim(X[,-3]) #remove y
dim(X_transformed) #y removed
visualize_3d(X_transformed, y)
As with the previous example, we can see that adding the additional feature improves the performance of the GLM.
X_df<-X[,-3]
get_training_results_glm(X_df, data$label)
get_training_results_glm(X_transformed, data$label)
Another way to add in non-linear features is to generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
We define the following polynomial features of degree $d$ for two variables $x_1$ and $x_2$:
$$\large \{x_1^d, x_1^{d-1}x_2, \ldots x_2^d\} = \{x_1^ix_2^j\}_{i+j=d, i,j \in \mathbb{N}}$$For example, for $d=3$, this will be the following features:
$$\large 1, x_1, x_2, x_1^2, x_1x_2, x_2^2, x_1^3, x_1^2x_2, x_1x_2^2, x_2^3$$poly <- function(X, degree) {
model.matrix(~ poly(X1, X2, degree=degree, raw=TRUE), data=X)
}
#This code only works in some version of R
set.seed(0)
X_df <- X[,-3]
X_df <- as.data.frame(X_df)
# Manually generate polynomial features up to degree 2
X_transformed <- data.frame(
X1_1 = 1,
X1 = X_df$X1,
X2 = X_df$X2,
X1_squared = X_df$X1^2,
X2_squared = X_df$X2^2,
X1_X2 = X_df$X1 * X_df$X2
)
We write a function that will loop through all the current features, and generate all the polynomial combinations.
create_polynomial_features <- function(X, degree) {
n <- nrow(X)
poly_features <- do.call(cbind, lapply(0:degree, function(d) {
if (d == 0) return(rep(1, n)) # Intercept term
if (d == 1) return(X) # Linear terms
return(sapply(1:ncol(X), function(i) X[, i]^d))
}))
# Adding interaction terms
if (degree > 1) {
for (i in 1:ncol(X)) {
for (j in i:ncol(X)) {
if (i != j) {
poly_features <- cbind(poly_features, X[, i] * X[, j])
}
}
}
}
return(poly_features)
}
We can use the prior 2D dataset as an example.
head(X,10)
The transformed dataset has added several new features.
X_transformed <- create_polynomial_features(X_df,2) #or we can use the create polynomial function
colnames(X_transformed) <- c("X1_1","X1","X2","X1_squared","X2_squared","X1_X2")
# View the first 10 transformed rows
X_transformed[1:10, ]
# Get feature names (R doesn't have direct 'get_feature_names_out', but we can infer)
colnames(X_transformed)
colnames(X) <- c("X1","X2","y")
y = X$y
log_reg_model <- glm(y ~ X1_1 + X1 + X2 + X1_squared + X2_squared + X1_X2,
data = X_transformed, family = binomial)
visualize_2d(X, y)
log_reg_model
log_reg_model <- glm(y ~ X1 + X2 + X1_squared + X2_squared + X1_X2,
data = X_transformed, family = binomial)
# Function to plot decision boundary
plot_lr_decision_boundary <- function(model, X) {
# Create grid for predictions
x1_vals <- seq(min(X[, 1]), max(X[, 1]), length.out = 100)
x2_vals <- seq(min(X[, 2]), max(X[, 2]), length.out = 100)
grid <- expand.grid(X1 = x1_vals, X2 = x2_vals)
# Add polynomial features to the grid
grid$X1_squared <- grid$X1^2
grid$X2_squared <- grid$X2^2
grid$X1_X2 <- grid$X1 * grid$X2
# Predict probabilities on the grid
grid$prob <- predict(model, newdata = grid, type = "response")
# Plot decision boundary and data points
ggplot() +
geom_contour(data = grid, aes(x = X1, y = X2, z = prob), breaks = c(0.5), color = "black") +
geom_point(data = data.frame(X, y = as.factor(y)), aes(x = X[, 1], y = X[, 2], color = as.factor(y)), size = 3) +
scale_color_manual(values = c("red", "blue")) +
labs(title = "Decision Boundary", x = "x1", y = "x2") +
theme_minimal()
}
# Plot decision boundary
plot_lr_decision_boundary(log_reg_model, X)
# Summary of the logistic regression model, including coefficients
summary(log_reg_model)
# Get coefficients directly
coef(log_reg_model)
names(X_transformed)
# Get the coefficients of the logistic regression model
coef(log_reg_model)
#We have to drop X1_1, which is where the 1 values are, since it gives us a null value
Some more example synthetic datasets
xor_data <- create_xor_dataset(200)
X <- xor_data$X
y <- xor_data$y
X_df <- as.data.frame(X)
colnames(X_df) <- c("x1","x2")
X_transformed <- data.frame(
X1_1 = 1,
X1 = X_df$x1,
X2 = X_df$x2,
X1_squared = X_df$x1^2,
X2_squared = X_df$x2^2,
X1_X2 = X_df$x1 * X_df$x2
)
# Fit a logistic regression model with polynomial features
log_reg_model <- glm(y ~ X1_1 + X1 + X2 + X1_squared + X2_squared + X1_X2,
data = X_transformed, family = binomial)
summary(log_reg_model)
#drop X1_1, which is the 1 value
log_reg_model <- glm(y ~ X1 + X2 + X1_squared + X2_squared + X1_X2,
data = X_transformed, family = binomial)
summary(log_reg_model)
fig = visualize_2d(X,y)
fig
plot_lr_decision_boundary(log_reg_model,X)
names(X_transformed)
coef(log_reg_model)
We demonstrated how polynomial features allow linear models to build nonlinear separating surfaces. Let's now show this visually.
Let's see how regularization affects the quality of classification on a dataset on microchip testing from Andrew Ng's course on machine learning. We will use logistic regression with polynomial features and vary the regularization parameter $C$. First, we will see how regularization affects the separating border of the classifier and intuitively recognize under- and overfitting.
# Load necessary libraries
library(readr)
# Download the data
url <- "https://raw.githubusercontent.com/ICBI/AIMAHEAD_GU_publicCourseData/main/ML_Concepts/microchip_tests.txt"
#"https://media.githubusercontent.com/media/ICBI/AIMAHEAD_GU_publicCourseData/refs/heads/main/ML_Concepts/microchip_tests.txt"
download.file(url, destfile = "microchip_tests.txt")
# Read the data
data <- read_csv("microchip_tests.txt", col_names = c("test1", "test2", "released"))
head(data)
dim(data)
# Counting occurrences in 'released' column
table(data$released)
# Extracting 'test1' and 'test2' as matrix (equivalent to X)
X <- as.matrix(data[, c("test1", "test2")])
# Extracting 'released' as a vector (equivalent to y)
y <- data$released
dim(X)
length(y)
fig <- visualize_2d(X, y)
fig
Let's train logistic regression with regularization parameter $C = 10^{-2}$.
install.packages("glmnet")
# Load necessary libraries
library(glmnet)
library(ggplot2)
# Function to create polynomial features
create_polynomial_features <- function(X, degree) {
n <- nrow(X)
poly_features <- do.call(cbind, lapply(0:degree, function(d) {
if (d == 0) return(rep(1, n)) # Intercept term
if (d == 1) return(X) # Linear terms
return(sapply(1:ncol(X), function(i) X[, i]^d))
}))
# Adding interaction terms
if (degree > 1) {
for (i in 1:ncol(X)) {
for (j in i:ncol(X)) {
if (i != j) {
poly_features <- cbind(poly_features, X[, i] * X[, j])
}
}
}
}
return(poly_features)
}
# Prepare X and y
X <- as.matrix(data[, 1:2]) # Only using the features
y <- as.factor(data$released) # Target variable
# Create polynomial features of degree 7
X_transformed <- create_polynomial_features(X, 7)
# Fit the logistic regression model
log_reg_model <- glmnet(X_transformed, as.numeric(y) - 1, family = "binomial", alpha = 0, lambda = 1e-2)
# Create a grid for prediction
x1_range <- seq(min(X[, 1]), max(X[, 1]), length.out = 100)
x2_range <- seq(min(X[, 2]), max(X[, 2]), length.out = 100)
grid <- expand.grid(test1 = x1_range, test2 = x2_range) # Ensure correct column names
# Create polynomial features for the grid
grid_transformed <- create_polynomial_features(as.matrix(grid), 7)
# Predict probabilities using the model
grid$prob <- predict(log_reg_model, newx = grid_transformed, type = "response")
# Create a base plot
plot <- ggplot(data, aes(x = test1, y = test2, color = released)) +
geom_point(size = 3, alpha = 0.6) +
labs(color = "True Label") +
theme_minimal()
# Add the decision boundary using contour plot
plot <- plot +
geom_contour(data = grid, aes(z = prob), breaks = 0.5, color = "black")
# Show the plot
print(plot)
get_training_results <- function(model, X, y) {
# Predictions
y_pred <- predict(model, X, type = "response") # Use type = "response" for probabilities
y_pred_classes <- ifelse(y_pred > 0.5, 1, 0) # Convert probabilities to class labels
# Print confusion matrix
y_levels <- unique(c(y, y_pred_classes)) # Ensure levels are consistent
y_pred_factor <- factor(y_pred_classes, levels = y_levels)
y_factor <- factor(y, levels = y_levels)
# Print classification report
confusion_matrix <- confusionMatrix(y_pred_factor, y_factor)
print(confusion_matrix)
}
# Using the function on the logistic regression pipeline
get_training_results(log_reg_model, X_transformed, y)
y_pred <- predict(log_reg_model, X_transformed, type = "response") # Use type = "response" for probabilities
y_pred_classes <- ifelse(y_pred > 0.5, 1, 0) # Convert probabilities to class labels
accuracy <- sum(y_pred_classes == y) / length(y)
cat("Accuracy on training set:", round(accuracy, 3), "\n")
We could now try increasing $C$ to 1. In doing this, we weaken regularization, and the solution can now have greater values (in absolute value) of model weights than previously. Now the accuracy of the classifier on the training set improves.
# Fit the logistic regression model
log_reg_model <- glmnet(X_transformed, as.numeric(y) - 1, family = "binomial", alpha = 0, lambda = 1)
y_pred <- predict(log_reg_model, X_transformed, type = "response") # Use type = "response" for probabilities
y_pred_classes <- ifelse(y_pred > 0.5, 1, 0) # Convert probabilities to class labels
accuracy <- sum(y_pred_classes == y) / length(y)
cat("Accuracy on training set:", round(accuracy, 3), "\n")
# Create a grid for prediction
x1_range <- seq(min(X[, 1]), max(X[, 1]), length.out = 100)
x2_range <- seq(min(X[, 2]), max(X[, 2]), length.out = 100)
grid <- expand.grid(test1 = x1_range, test2 = x2_range) # Ensure correct column names
# Create polynomial features for the grid
grid_transformed <- create_polynomial_features(as.matrix(grid), 7)
# Predict probabilities using the model
grid$prob <- predict(log_reg_model, newx = grid_transformed, type = "response")
# Create a base plot
plot <- ggplot(data, aes(x = test1, y = test2, color = released)) +
geom_point(size = 3, alpha = 0.6) +
labs(color = "True Label") +
theme_minimal()
# Add the decision boundary using contour plot
plot <- plot +
geom_contour(data = grid, aes(z = prob), breaks = 0.5, color = "black")
# Show the plot
print(plot)
Then, why don't we increase $C$ even more - up to 10,000? Now, regularization is clearly not strong enough, and we see overfitting. Note that, with $C$=1 and a "smooth" boundary, the share of correct answers on the training set is not much lower than here. But one can easily imagine how our second model will work much better on new data.
# Fit logistic regression with regularization (equivalent to `C = 1e3`)
lambda <- 1 / 1e3 # In glmnet, lambda is the inverse of C
log_reg_model <- glmnet(X_transformed, y, family = "binomial", lambda = lambda)
# Make predictions
y_pred <- predict(log_reg_model, newx = X_transformed, type = "response")
visualize_2d_decisionboundary <- function(X, y, model, degree) {
# Convert X to data frame
X_df <- data.frame(x1 = X[,1], x2 = X[,2], y = as.factor(y))
# Create a grid of values to predict
grid <- expand.grid(x1 = seq(min(X[,1]), max(X[,1]), length.out = 100),
x2 = seq(min(X[,2]), max(X[,2]), length.out = 100))
# Add polynomial features for the grid points
grid_poly <- create_polynomial_features(as.matrix(grid), 7)
# Get decision boundary by predicting probabilities
grid$prob <- predict(model, newx = grid_poly, type = "response")
# Plot the data points
p <- ggplot(X_df, aes(x = x1, y = x2, color = y)) +
geom_point(size = 3) +
geom_contour(data = grid, aes(z = prob), breaks = 0.5, color = "black") +
ggtitle("Decision Boundary with Polynomial Features (degree = 7)")
print(p)
}
# Visualize the decision boundary
visualize_2d_decisionboundary(X, y, log_reg_model, degree = 7)
y_pred <- predict(log_reg_model, X_transformed, type = "response") # Use type = "response" for probabilities
y_pred_classes <- ifelse(y_pred > 0.5, 1, 0) # Convert probabilities to class labels
accuracy <- sum(y_pred_classes == y) / length(y)
cat("Accuracy on training set:", round(accuracy, 3), "\n")
To discuss the results, let's rewrite the function that is optimized in logistic regression with the form:
$$\large J(X,y,w) = \mathcal{L} + \frac{1}{C}||w||^2,$$where
sklearn's implementation of LogisticRegression)Questions?
the larger the parameter $C$, the more complex the relationships in the data that the model can recover (intuitively $C$ corresponds to the "complexity" of the model - model capacity)
Generate Linearly seperable data
set.seed(0)
blobs <- make_blobs(n_samples = 50, n_features = 2, centers = 2, cluster_std = 0.60)
X <- blobs$X
y <- blobs$labels
A linear discriminative classifier would attempt to draw a straight line separating the two sets of data, and thereby create a model for classification. For two dimensional data like that shown here, this is a task we could do by hand. But immediately we see a problem: there is more than one possible dividing line that can perfectly discriminate between the two classes!
We can draw them as follows:
visualize_2d <- function(X, y, title = "Data in 2D") {
# Convert to a data frame
df <- data.frame(X1 = X[, 1], X2 = X[, 2], label = as.factor(y))
# Create the plot
plot <- ggplot(df, aes(x = X1, y = X2, color = label)) +
geom_point(size = 3) +
labs(title = title, x = "x1", y = "x2") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
return(plot) # Return the ggplot object (equivalent to fig in Python)
}
visualize_2d(X,y)
# Combine into a dataframe
X_df <- data.frame(X1 = X[,1], X2 = X[,2], y = y)
ggplot(X_df, aes(x = X1, y = X2, color = y)) +
geom_point(size = 4) +
annotate("point", x = 4, y = 3.5, color = "red", shape = 4, size = 4, stroke = 2) +
geom_abline(slope = -1, intercept = 7, color = "black") +
geom_abline(slope = 2.5, intercept = -9.8, color = "black") +
geom_abline(slope = -0.6, intercept = 5.3, color = "black") +
labs(title = "Scatterplot with Decision Boundaries", x = "X1", y = "X2") +
theme_minimal()
These are three very different separators which, nevertheless, perfectly discriminate between these samples. Depending on which you choose, a new data point (e.g., the one marked by the "X" in this plot) will be assigned a different label! Evidently our simple intuition of "drawing a line between classes" is not enough, and we need to think a bit deeper.
SVM: Maximizing the Margin
Support vector machines offer one way to improve on this. The intuition is this: rather than simply drawing a zero-width line between the classes, we can draw around each line a margin of some width, up to the nearest point. Here is an example of how this might look:
# Define x values for decision boundaries
x_vals <- seq(min(X_df$X1) - 0.5, max(X_df$X1) + 0.5, length.out = 100)
# Create data for the boundary ribbons
ribbon_data <- data.frame(
x = rep(x_vals, 3),
ymin = c(-1 * x_vals + 7 - 0.5, 2.5 * x_vals - 9.8 - 0.5, -0.6 * x_vals + 5.3 - 0.5),
ymax = c(-1 * x_vals + 7 + 0.5, 2.5 * x_vals - 9.8 + 0.5, -0.6 * x_vals + 5.3 + 0.5),
group = factor(rep(1:3, each = length(x_vals))) # Group for the ribbons
)
p <- ggplot(X_df, aes(x = X[,1], y = X[,2], color = y)) +
geom_point(size = 4) + # Scatter plot
annotate("point", x = 4, y = 3.5, color = "red", shape = 4, size = 4, stroke = 2) +
# Add shaded areas
geom_ribbon(data = ribbon_data, aes(x = x, ymin = ymin, ymax = ymax, fill = group),
inherit.aes = FALSE, alpha = 0.4) +
# Add decision boundary lines and shaded areas
geom_abline(slope = -1, intercept = 7, color = "black") +
#geom_ribbon(data = data.frame(x = c(-1, 3.5)), aes(x = x, ymin = -1 * x + 7 - 0.33, ymax = -1 * x + 7 + 0.33), fill = "#AAAAAA", alpha = 0.4) +
geom_abline(slope = 2.5, intercept = -9.8, color = "black") +
#geom_ribbon(data = data.frame(x = c(-1, 3.5)), aes(x = x, ymin = 2.5 * x - 9.8 - 0.55, ymax = 2.5 * x - 9.8 + 0.55), fill = "#AAAAAA", alpha = 0.4) +
geom_abline(slope = -0.6, intercept = 5.3, color = "black") +
#geom_ribbon(data = data.frame(x = c(-1, 3.5)), aes(x = x, ymin = -0.6 * x + 5.3 - 0.2, ymax = -0.6 * x + 5.3 + 0.2), fill = "#AAAAAA", alpha = 0.4) +
labs(title = "Scatterplot with Decision Boundaries", x = "X1", y = "X2") +
theme_minimal()
# Show the plot
print(p)
In support vector machines, the line that maximizes this margin is the one we will choose as the optimal model. Support vector machines are an example of such a maximum margin estimator.
Fit a SVM
# Create the SVM model
svm_model <- svm(as.factor(y) ~ X1 + X2, data = X_df, kernel = "linear", scale = FALSE)
Visualize the decision boundry with support vectors
library(e1071)
# Function to plot decision function and margins
plot_svc_decision_function <- function(svm_model, X_df) {
# Create a grid of values for X1 and X2
xlim <- range(X_df$X1)
ylim <- range(X_df$X2)
x_vals <- seq(xlim[1], xlim[2], length.out = 100)
y_vals <- seq(ylim[1], ylim[2], length.out = 100)
grid <- expand.grid(X1 = x_vals, X2 = y_vals)
# Predict decision function on the grid
decision_values <- predict(svm_model, grid, decision.values = TRUE)
decision_matrix <- attributes(decision_values)$decision
# Add decision values to the grid
grid$decision <- decision_matrix
# Create the ggplot
p <- ggplot(X_df, aes(x = X1, y = X2, color = as.factor(y))) +
geom_point(size = 4) +
# Add decision boundary and margins
stat_contour(data = grid, aes(x = X1, y = X2, z = decision),
breaks = c(-1, 0, 1),
color = "black", linetype = "dashed") + # Use single linetype
stat_contour(data = grid, aes(x = X1, y = X2, z = decision), breaks = c(0),
color = "black", linetype = "solid") +
# Plot the support vectors
geom_point(data = data.frame(svm_model$SV), aes(x = X1, y = X2), shape = 21, size = 5, stroke = 1.5, color = "blue", fill = NA) +
labs(title = "SVM Decision Boundary and Margins", x = "X1", y = "X2") +
theme_minimal()
print(p)
}
# Plot the decision function and margins
plot_svc_decision_function(svm_model, X_df)
This is the dividing line that maximizes the margin between the two sets of points. Notice that a few of the training points just touch the margin: they are indicated by the black circles in this figure. These points are the pivotal elements of this fit, and are known as the support vectors, and give the algorithm its name.
# Access support vectors
support_vectors <- svm_model$SV
# View the support vectors
support_vectors
plot_svm <- function(N = 10) {
set.seed(0)
# Generate data
X <- matrix(rnorm(400), ncol = 2)
y <- ifelse(X[,1] + X[,2] > 0, 1, 0)
# Subset the data
X <- X[1:N, ]
y <- y[1:N]
# Convert to data frame
X_df <- data.frame(X1 = X[, 1], X2 = X[, 2], y = as.factor(y))
# Train SVM
svm_model <- svm(y ~ X1 + X2, data = X_df, kernel = "linear", cost = 1e10, scale = FALSE)
# Create grid for decision boundary plot
xlim <- range(X_df$X1)
ylim <- range(X_df$X2)
x_vals <- seq(xlim[1], xlim[2], length.out = 100)
y_vals <- seq(ylim[1], ylim[2], length.out = 100)
grid <- expand.grid(X1 = x_vals, X2 = y_vals)
# Predict decision function on the grid
decision_values <- predict(svm_model, grid, decision.values = TRUE)
grid$decision <- attributes(decision_values)$decision
# Plot decision boundaries and margins
p <- ggplot(X_df, aes(x = X1, y = X2, color = y)) +
geom_point(size = 4) +
stat_contour(data = grid, aes(x = X1, y = X2, z = decision),
breaks = c(-1, 0, 1),
color = "black", linetype = "dashed") +
geom_point(data = data.frame(svm_model$SV), aes(x = X1, y = X2),
shape = 21, size = 5, stroke = 1.5, color = "blue", fill = NA) +
labs(title = paste("SVM Decision Boundary, N =", N), x = "X1", y = "X2") +
theme_minimal()
return(p)
}
# Plot for different N values
p1 <- plot_svm(N = 60)
p2 <- plot_svm(N = 120)
# Combine plots
install.packages("gridExtra")
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)
Where SVM becomes extremely powerful is when it is combined with kernels. We have seen a version of kernels before, in the basis polynomial transformations in Logistic Regression. There we projected our data into higher-dimensional space defined by feature crosses, and thereby were able to fit for nonlinear relationships with a linear classifier.
Generate non-linear seperable dataset and fit linear SVM
data <- make_circles(n_samples = 200, factor = 0.2, noise = 0.1)
svm_model <- svm(label ~ X1 + X2, data = data, kernel = "linear", cost = 1)
# Create a grid of points to evaluate the decision boundary
grid <- expand.grid(X1 = seq(min(data$X1)-1, max(data$X1)+1, length.out = 300),
X2 = seq(min(data$X2)-1, max(data$X2)+1, length.out = 300))
# Get decision values for the grid points
decision_values <- predict(svm_model, grid, decision.values = TRUE)
# Create a ggplot scatter plot
ggplot(data, aes(x = X1, y = X2, color = as.factor(label))) +
geom_point(size = 2) +
geom_point(data = as.data.frame(svm_model$SV), aes(x = X1, y = X2),
shape = 21, size = 3, stroke = 1.5, color = "blue", fill = NA) +
geom_contour(data = cbind(grid, decision_values), aes(z = decision_values),
breaks = 0, color = "black") +
geom_contour(data = cbind(grid, decision_values), aes(z = decision_values),
breaks = c(-1, 1), linetype = "dashed", color = "black") +
theme_minimal()
get_training_results_svm <- function(data) {
y <- data$label
# Fit svm model
X <- data[,c(1,2)]
model <- svm(label ~ X1 + X2, data = data, kernel = "linear", cost = 1)
# Predict class probabilities on training data
y_pred_prob <- predict(model, X, type = "response")
# Convert probabilities to binary labels (threshold = 0.5)
y_pred <- ifelse(y_pred_prob > 0.5, 1, 0)
# Confusion matrix and classification report
results <- confusionMatrix(as.factor(y_pred), as.factor(y), mode = "everything")
# Print classification metrics (precision, recall, F1-score, etc.)
print(results)
}
get_training_results_svm(data)
If we could find a higher dimensional space in which these points were linearly separable, then we could do the following:
There are many higher dimensional spaces in which these points are linearly separable. Here is one example
x1,x2:→z1,z2,z3 z1=√x1x2 z2=x1^2 z3=x1*x^2
This is where the Kernel trick comes into play.
Learn more kernels here :
Kernel is a way of computing the dot product of two vectors x and y in some (possibly very high dimensional) feature space, which is why kernel functions are sometimes called "generalized dot product".
The RBF and polynomial are especially useful when the data-points are not linearly separable.
Fitting a radial basis function (RBF) kernel
RBF kernels are the most generalized form of kernelization and is one of the most widely used kernels due to its similarity to the Gaussian distribution. The RBF kernel function for two points X₁ and X₂ computes the similarity or how close they are to each other.
set.seed(0)
y <- data$label
# Fit svm model
X <- data[,c(1,2)]
clf <- svm(x = X, y = as.factor(y), kernel = "radial") # Use "radial" for RBF
# Function to get training results
get_training_results_svmrad <- function(model, X, y) {
y_pred <- predict(model, X) # Predict using the model
# Print classification report
print(confusionMatrix(as.factor(y_pred), as.factor(y))) # Confusion matrix
accuracy <- sum(y_pred == y) / length(y) # Calculate accuracy
cat("Accuracy on training set:", round(accuracy, 3), "\n") # Print accuracy
}
# Call the function with the trained model
get_training_results_svmrad(clf, X, y)
# Create a data frame for plotting
X_df <- data.frame(X1 = X[, 1], X2 = X[, 2], y = as.factor(y))
# Function to plot SVM decision function
plot_svm_decision_function <- function(model, data) {
# Create grid for decision boundary
xlim <- range(data$X1)
ylim <- range(data$X2)
grid <- expand.grid(X1 = seq(xlim[1], xlim[2], length.out = 100),
X2 = seq(ylim[1], ylim[2], length.out = 100))
# Get predictions for the grid
grid$decision <- predict(model, grid)
# Create ggplot
p <- ggplot(data, aes(x = X1, y = X2)) +
geom_point(aes(color = y), size = 4) +
geom_tile(data = grid, aes(fill = decision), alpha = 0.2) + # Use geom_tile to fill the decision area
scale_fill_manual(values = c('red', 'blue'), guide = FALSE) +
labs(title = "SVM Decision Function", x = "X1", y = "X2") +
theme_minimal()
return(p)
}
# Plot SVM decision function
p <- plot_svm_decision_function(clf, X_df)
print(p)
plot_svm_decision_function <- function(model, data) {
# Create grid for decision boundary
xlim <- range(data$X1)
ylim <- range(data$X2)
grid <- expand.grid(X1 = seq(xlim[1], xlim[2], length.out = 100),
X2 = seq(ylim[1], ylim[2], length.out = 100))
# Get decision values for the grid
decision_values <- predict(model, grid, decision.values = TRUE)
grid$decision <- attributes(decision_values)$decision.values
# Check if the decision values are valid
if (all(is.na(grid$decision))) {
stop("All decision values are NA. Please check your SVM model.")
}
# Create ggplot with contour lines
p <- ggplot(data, aes(x = X1, y = X2)) +
geom_point(aes(color = y), size = 4) +
geom_contour(data = grid, aes(z = decision), breaks = 0, color = "black") + # Contour line at decision boundary
geom_contour(data = grid, aes(z = decision), breaks = c(-0.5, 0.5), linetype = "dashed", color = "black") +
labs(title = "SVM Decision Function with Contours", x = "X1", y = "X2") +
theme_minimal() +
scale_color_manual(values = c('red', 'blue')) # Adjust colors as needed
return(p)
}
# Plot SVM decision function
p <- plot_svm_decision_function(clf, X_df)
print(p)
Fitting a polynomial kernel. This kernel is useful is the decision boundary is can be seperated by a higher-order polynomial
clf <- svm(y ~ ., data = data.frame(X1 = X[, 1], X2 = X[, 2], y = factor(y)), kernel = "polynomial")
p <- plot_svm_decision_function(clf, data.frame(X1 = X[, 1], X2 = X[, 2], y = factor(y)))
print(p)
# Fit SVM model with polynomial kernel
clf <- svm(y ~ ., data = data.frame(X1 = X[, 1], X2 = X[, 2], y = factor(y)), kernel = "polynomial")
# Function to get training results
get_training_results <- function(model, data) {
y_pred <- predict(model, data)
# Convert probabilities to binary labels (threshold = 0.5)
#y_pred <- ifelse(y_pred_prob > 0.5, 1, 0)
confusion_matrix <- confusionMatrix(as.factor(y_pred), as.factor(data$y))
print(confusion_matrix)
accuracy <- sum(y_pred == data$y) / length(y_pred)
cat("Accuracy on training set:", round(accuracy, 3), "\n")
}
# Get training results
get_training_results(clf, data.frame(X1 = X[, 1], X2 = X[, 2], y = factor(y)))
# Function to plot SVM decision function with contours
plot_svm_decision_function <- function(model, data) {
# Create grid for decision boundary
xlim <- range(data$X1)
ylim <- range(data$X2)
grid <- expand.grid(X1 = seq(xlim[1], xlim[2], length.out = 100),
X2 = seq(ylim[1], ylim[2], length.out = 100))
# Get predictions for the grid
grid$pred <- predict(model, grid)
# Check the unique predicted values
unique_preds <- unique(grid$pred)
if (length(unique_preds) < 2) {
stop("Insufficient unique predictions for contour plotting.")
}
# Create ggplot with decision boundary
p <- ggplot(data, aes(x = X1, y = X2)) +
geom_point(aes(color = y), size = 4) +
geom_contour(data = grid, aes(z = as.numeric(pred)), breaks = unique(as.numeric(data$y)), color = "black") + # Contour lines
labs(title = "SVM Decision Function with Polynomial Kernel", x = "X1", y = "X2") +
theme_minimal() +
scale_color_manual(values = c('red', 'blue')) # Adjust colors as needed
return(p)
}
# Plot SVM decision function
p <- plot_svm_decision_function(clf, data.frame(X1 = X[, 1], X2 = X[, 2], y = factor(y)))
print(p)
plot(clf, data)
X_df <- data.frame(X1 = X[, 1], X2 = X[, 2], y = y)
plot_svm_decision_function_with_colors <- function(model, data) {
# Create grid for decision boundary
xlim <- range(data$X1)
ylim <- range(data$X2)
grid <- expand.grid(X1 = seq(xlim[1], xlim[2], length.out = 100),
X2 = seq(ylim[1], ylim[2], length.out = 100))
# Get predictions for the grid
grid$decision <- predict(model, grid)
grid$decision <- factor(grid$decision) # Convert predictions to factor (discrete)
# Create ggplot with color chart (legend)
p <- ggplot(data, aes(x = X1, y = X2)) +
geom_point(aes(color = y), size = 4) +
#geom_contour(data = grid, aes(z = as.numeric(decision)), breaks = c(0.5), color = "black") + # Contour line at decision boundary
#scale_color_manual(values = c('red', 'blue'), labels = c("Class 0", "Class 1")) + # Class colors
geom_raster(data = grid, aes(fill = decision), alpha = 0.3) + # Add filled decision regions
scale_fill_manual(values = c('lightblue', 'lightcoral'), labels = c("Class 0 Region", "Class 1 Region")) + # Fill colors for regions
labs(title = "SVM Decision Function with Color Chart", x = "X1", y = "X2", fill = "Regions") +
theme_minimal()
}
# Plot SVM decision function with color chart
p <- plot_svm_decision_function_with_colors(clf, X_df)
print(p)
Another dataset
library(mlbench)
library(ggplot2)
generate_moons <- function(n = 200, noise = 0.3) {
t <- seq(0, pi, length.out = n / 2)
x1 <- c(cos(t), 1 - cos(t) + 0.5)
y1 <- c(sin(t), -sin(t) - 0.5)
X <- cbind(x1, y1)
# Add noise to the data
X <- X + matrix(rnorm(length(X), sd = noise), ncol = 2)
y <- as.factor(c(rep(0, n / 2), rep(1, n / 2)))
return(data.frame(X1 = X[,1], X2 = X[,2], y = y))
}
# Generate the dataset
moon_df <- generate_moons(n = 200, noise = 0.3)
# Custom color palette (for two classes: red and blue)
custom_colors <- c('#FF0000', '#0000FF')
# Plot the data using ggplot2
ggplot(moon_df, aes(x = X1, y = X2, color = y)) +
geom_point(size = 3) +
scale_color_manual(values = custom_colors) +
labs(title = 'Scatter Plot of make_moons-like dataset', x = 'Feature 1', y = 'Feature 2') +
theme_minimal()
svm_model <- svm(y ~ X1 + X2, data = moon_df, kernel = "linear", cost = 1)
moon_df
# Predict decision values (continuous scores, not classes)
decision_values <- attributes(predict(svm_model, grid, decision.values = TRUE))$decision.values
# Check if decision_values contains a meaningful range
cat("Range of decision values: ", range(decision_values), "\n")
# Add decision values to the grid data
grid_data <- cbind(grid, decision_values = decision_values)
# Plot
ggplot(moon_df, aes(x = X1, y = X2, color = as.factor(y))) +
geom_point(size = 2) +
geom_point(data = as.data.frame(svm_model$SV), aes(x = X1, y = X2),
shape = 21, size = 3, stroke = 1.5, color = "blue", fill = NA) +
geom_contour(data = grid_data, aes(x = X1, y = X2, z = decision_values),
breaks = 0, color = "black") + # Decision boundary (where decision_values = 0)
geom_contour(data = grid_data, aes(x = X1, y = X2, z = decision_values),
breaks = c(-0.5, 0.5), linetype = "dashed", color = "black") + # Margins
labs(color = "y values") +
theme_minimal()
get_training_results_svm <- function(data) {
y <- data$y
# Fit svm model
X <- data[,c(1,2)]
model <- svm(y ~ X1 + X2, data = data, kernel = "linear", cost = 1)
# Predict class probabilities on training data
y_pred <- predict(model, X, type = "response")
# Convert probabilities to binary labels (threshold = 0.5)
# y_pred <- ifelse(y_pred_prob > 0.5, 1, 0)
# Confusion matrix and classification report
results <- confusionMatrix(as.factor(y_pred), as.factor(y), mode = "everything")
# Print classification metrics (precision, recall, F1-score, etc.)
print(results)
}
get_training_results_svm(moon_df)
# Fit SVM model with polynomial kernel
clf <- svm(y ~ ., data = moon_df, kernel = "polynomial")
# Get training results
get_training_results(clf, moon_df)
# Function to plot SVM decision function with contours
# Plot SVM decision function
p <- plot_svm_decision_function(clf, moon_df)
print(p)
plot_svm_decision_function <- function(model, data) {
# Create grid for decision boundary
xlim <- range(data$X1)
ylim <- range(data$X2)
grid <- expand.grid(X1 = seq(xlim[1], xlim[2], length.out = 100),
X2 = seq(ylim[1], ylim[2], length.out = 100))
# Get decision values for the grid
decision_values <- predict(model, grid, decision.values = TRUE)
grid$decision <- attributes(decision_values)$decision.values
# Check if the decision values are valid
if (all(is.na(grid$decision))) {
stop("All decision values are NA. Please check your SVM model.")
}
# Create ggplot with contour lines
p <- ggplot(data, aes(x = X1, y = X2)) +
geom_point(aes(color = y), size = 4) +
geom_contour(data = grid, aes(z = decision), breaks = 0, color = "black") + # Contour line at decision boundary
geom_contour(data = grid, aes(z = decision), breaks = c(-0.5, 0.5), linetype = "dashed", color = "black") +
labs(title = "SVM Decision Function with Contours", x = "X1", y = "X2") +
theme_minimal() +
scale_color_manual(values = c('red', 'blue')) # Adjust colors as needed
return(p)
}
moon_X <- moon_df[,c(1,2)]
moon_y <- moon_df[,3]
clf <- svm(x = moon_X, y = as.factor(moon_y), kernel = "radial") # Use "radial" for RBF
p <- plot_svm_decision_function(clf, moon_df)
print(p)
get_training_results_svmrad(clf, moon_X, moon_y)
p <- plot_svm_decision_function_with_colors(clf, moon_df)
print(p)
The END
Authors: Dr. Samir Gupta, Dr. Matthew McCoy, Jia Li Dong (M.S.) & ICBI AIM-AHEAD Team
